Applying the Munich Parametric High Definition (PHD) Speech Synthesis System to the Problem of Teaching Chinese Tones to L1-Speakers of German
نویسندگان
چکیده
The aim of this paper is threefold: First and foremost we would like to propose a new strategy of phonetic speech research that can be distinguished from more traditional research programs in more than one important aspect. As directly opposed to the classical analysis-by-synthesis paradigm our second aim is to install the new paradigm of synthesis-by-analysis. And our third aim is to convince our audience that only strictly application-oriented phonetic speech research will lead to a deeper understanding of how speech acts really function phonetically. The paper will be given in two parts. After the presentation of the philosophy behind our new research activity mentioned in the title, the second author will present a demonstration of the PHD-system in order to illustrate its applicability for solving the problems described by the first author. 1. The challenge: A long term applicationoriented phonetic research program In spoken language processing (SLP) one of the most provoking challenges of speech-based man-machine communication is the development of intelligent automatic systems for teaching wellmotivated individual learners to speak a new foreign language fluently after as short a training time as possible. Inspired by the work of the pioneers in this new research field (such as Rodolfo Delmonte, Farzad Ehsani, Maxine Eskenazi or Stephanie Seneff, to mention only a few names in alphabetic order) and looking at already existing Spoken Language Learning Systems (such as Delmonte’s [5] or the MIT SLLS for Mandarin [4]) and with respect to the fact, that in the well established research field of second language acquisition there is not yet much knowledge available for reducing foreign accents [6] — at least concerning very ‘narrow’ phonetic conditions — we have come to the decision to propose the following applicationoriented phonetic research strategy which starts with the following steps: (i) use the existing technology of SLP and develop a laptop-system that teaches L1-speakers of German to reproduce the tones of Mandarin Chinese; (ii) begin with single syllables and isolated words to demonstrate their tones in a very clear form; (iii) take the reproductions of the learning speaker, analyze them and, if necessary, modify them into a corrected form; (iv) present the original form as well as the corrected one, both in the voice of the individual speaker, to the learner so that he immediately can compare both of them directly in order to see (on the screen) and hear (on his headphones) the relevant differences. In many repeated such sessions the brain of the learner has to develop the neural programs for recognizing and reproducing the relevant categories that are to be distinguished in the new language (cf. for instance [3] where L1-speakers of Japanese were trained to master the /r/–/l/-distinction in prevocalic syllable position). But these, of course, are only the first steps towards teaching the pronunciation of Mandarin tones to the learner. The tones of Chinese change their form as soon as the lexical items are uttered in connected speech [14]. It is probably for this reason that up to now tones are not taken into account in the automatic speech recognition technology of this language. Here it becomes clear why the development of the proposed automatic system must be seen as a phonetic research tool. The correct forms are to be parametrically analyzed and properly transferred to the complex utterances in the voice of the individual learner. And in this way we get the data collections which are needed for developing the phonetic theory of proper speech production in incremental steps and — at the same time — to scale and enlarge the domain of the teaching material. Application-oriented speech research of this kind has two happy side-effects. 1. The application itself can be described quite naturally in ordinary language (and everybody will then also be able to judge whether the application indeed works satisfactorily). This aspect is very helpful for getting the necessary funding. And there is a very convincing argument which says that anybody who has brought himself into the situation of being able to pronounce the words of a new foreign language correctly will also start to speak this language in a much shorter time. 2. But in order to be able to successfully approach such an easily describable application effectively, one has to translate the ordinary language description into the technical terms outlining the tough problems requiring resolution by means of scientific work. And here the second side-effect consists in the fact that the resulting set of problems that must be solved is so rich in its complex structure that nobody could invent it without the given application. This will be discussed in the following sections. It should be mentioned that our phonetic research proposal has been motivated by two technical developments in Munich. The first one has to do with the experience we have gained with the Munich Automatic Segmentation System (MAUS) [1]. It allowed us to reliably segment and annotate spontaneous speech in terms of the actually realized sound elements. And this works because speech recognition technology is only used for the purpose of speech verification (knowing the text of a given utterance the canonical form of the lexical items in the spoken utterance can be identified for deriving and generating all possible and also even impossible pronunciation variants; so the actually given phonetic facts will be verified). The second argument for our proposal can be seen in the fact that all the available SLPtechnologies of digital speech signal processing can be now used in a controlled manner for proper complex modification of a parametrically analyzed naturally produced utterance [11]. 2. Speech acts and utterances Traditional speech scientists have pointed out the trivial fact that there is no natural speech act without the concrete utterance of a given speaker (but they were mainly interested in analyzing any given utterance with respect to the phoneme system of the language of the speaker: take Bloomfield’s first definition of his Set of Postulates [2] for the science of language or read Trubetzkoy’s first sentence in his Principles of Phonology [13]). Future speech scientists will have to look at the great variability of different speaking styles and will have to try to give a precise answer to the question of what kinds of speech acts can be inferred from the phonetic form of any given utterance if it is produced by a real speaker. Given the context of our research proposal we only have to distinguish, in a first step, two quite different kinds of speech acts. This clearly depends on the intention of the speaker. In the first case the speaker wants to present to the audience (or to himself) nothing else but the utterance itself. In this very special case we logically get the autonymical form of an utterance. The meaning of such an utterance is true, if it demonstrates (or presents the instantiation of) a phonetically correct form of the given category. Clear speech in a dictating context can serve as a good example of autonymically produced utterances. In normal speech situations the speaker automatically transcends the utterance he is producing in order to semantically master the concrete or abstract situation he is acting in. In this case the utterance is logically produced in a heteronymic form. The speaker produces in an act of speech an utterance in order to express himself without taking care of the phonetic form of his utterance. The distinction of an autonymic vs. a heteronymic use of speech utterances is central and crucial for our research proposal because autonymically and heteronymically used forms are phonetically pronounced quite differently. On the other hand it should be also mentioned here that in all the well-known philosophical theories of speech acts utterances don’t really play a central role. Philosophical speech act theorists try to answer the question why and how it is possible that a speaker — just by only uttering “p” — can effectively express the meaning of p or even can convey the truth of a proposition p. For them the production of an utterance “U” seems not to be any problem at all. In this situation we have to realize that the production of an autonymically or heteronymically used utterance is still an unsolved empirical problem that can never be explained seriously by only a philosophical theory. 3. The complexity of phonetic facts The traditional term ‘utterance’ as used by linguists or philosophers is systematically ambiguous. It has at least two empirical meanings. This is because we have to take into account two different kinds of realities. First we must take the utterance — especially in its autonymic form — as what is (or can be) directly perceived by the speakers and the listeners during an act of speech production. On the other hand there is the physical world which remains trans-phenomenal to the speakers and listeners of any natural speech act. It is this second reality where the phonetic speech scientist can derive the time functions of the speech signal and store them in a digitized form on a disk. Perceived utterances cannot be stored in this manner. They are always bound to a perceiving subject, but for the perceiving subject they have a category which can either be demonstrated by another autonymic categorical reproduction or they must be named by the use of a symbolic representation. In logical terms, the relation between these two kinds of empirical data — categories and time functions, symbols and signals — is in not an analytical one. Because they are logically independent of each other and given as contiguous facts, we may say that they are empirically related in a very strong fashion: The categories of a speech database can be experimentally reproduced by just reproducing the time function that belongs to the given speech utterance. On this theoretical background we are now in a position to introduce the concept of phonetic facts. A phonetic fact is the utterance (of a speaker produced in a speech act) that has a certain category and an empirically verifiable time function. Both can be stored in a spoken language database as a categorically annotated digital speech signal. From a cognitive point of view we can say that during any speech act the speaking nervous system produces at its periphery very complex speech movements that are — only partly visible — observed by the sensory systems of speakers and listeners in order to identify the complex category of the observable phonetic facts that establishes the given speech act, be it phonetically an autonymic or heteronymic one. Any act of speech is complex even if the speaker produces only a single speech sound or the citation form of a short word. So also the demonstration of a cardinal vowel produced with a certain tone of a certain tone language is by itself a very complex action. If such a nearly elementary categorical autonymic action is integrated in an utterance of much greater complexity we get functional variability. Functional variability in a naturally produced CVC-stream is controlled by different prosodies which depend on the act of speech in a given semantic and pragmatic situation. We do not yet know much about the rules that determine the complex structure of functional variability of phonetic facts. (So again: Future speech scientists will have to find a precise answer to the question of what different kinds of speech acts can be inferred from the phonetic form of a given utter-
منابع مشابه
Perception and Production of Mandarin Tones by German Speakers
This study investigates the possible errors related to Mandarin tone perception and production by German speakers. In a preliminary test, 23 German listeners should identify the tones of 186 monosyllables. Results show that exposure to Mandarin Chinese can help to discriminate lexical tones as highly expected. In the main experiment, 17 German subjects were asked to take part in a perception an...
متن کاملA Pragmatic Study of Requestive Speech Act by Iranian EFL Learners and Canadian Native Speakers in Hotels
This study was an attempt to shed light on the use of requestive speech act by Iranian nonnative speakers (NNSs) of English and Canadian native speakers (NSs) of English to find out the (possible) similarities and/or differences between the request realizations, and to investigate the influence of the situational variables of power, distance, context familiarity, and L1’s (possible) influence. ...
متن کاملImproving ASR performance on non-native speech using multilingual and crosslingual information
This paper presents our latest investigation of automatic speech recognition (ASR) on non-native speech. We first report on a non-native speech corpus an extension of the GlobalPhone database which contains English with Bulgarian, Chinese, German and Indian accent and German with Chinese accent. In this case, English is the spoken language (L2) and Bulgarian, Chinese, German and Indian are the ...
متن کاملCross–linguistic Comparison of Refusal Speech Act: Evidence from Trilingual EFL Learners in English, Farsi, and Kurdish
To date, little research on pragmatic transfer has considered a multilingual situation where there is an interaction among three different languages spoken by one person. Of interest was whether pragmatic transfer of refusals among three languages spoken by the same person occurs from L1 and L2 to L3, L1 to L2 and then to L3 or from L1 and L1 (if there are more than one L1) to L2. This study ai...
متن کاملLexical Tones Learning with Automatic Music Composition System Considering Prosody of Mandarin Chinese
Recent research has found that there is an overlap in the processing of music and speech in certain aspects. This research focuses on the relationship between the pitch of tones in language and the melody of songs. We present an automatic music composition system based on the prosody rules of Mandarin and we hypothesize that songs generated with our proposed system can help non-native Mandarin ...
متن کامل